Part 3. Learning from random samples (part a)
So far we have talked about
So far: Given known random process (known contents of urn), what will we observe? (Probability.)
Now: We switch to statistics – we try to figure out what is in a population (the urn) from a sample.
We will still sometimes assume we know what it’s in the urn so that we can evaluate our procedures.
Suppose we want to measure a characteristic of a large population (e.g. average concern about climate change in US on 0-4 scale).
We contact a sample of size \(n = 1000\).
Let \(X_i\) denote the response of the \(i\)th person we contact (so we have \(X_1, X_2, \ldots, X_n\)).
Is \(X_1\) a random variable? What is its PMF/PDF? And what about \(X_2, \ldots, X_n\)?
A & M call the PMF of a RV sampled from a population the finite population mass function \(f_{FP}(x)\).
Our sample \(X_1, X_2, \ldots, X_n\) is independent and identically distributed (IID) if
Then \(X_1, X_2, \ldots, X_n\) can be thought of as \(n\) samples from a single RV \(X\).
Are these sampling approaches IID?
IID is an approximation that lets us treat \(X_1, X_2, \ldots, X_n\) as repeated samples from the same RV \(X\). We will use it.
Definition 3.2.1 Sample statistic
For IID random variables \(X_1, X_2, \ldots, X_n\), a sample statistic \(T_{(n)}\) is a function of \(X_1, X_2, \ldots, X_n\):
\[T_{(n)} = h_{(n)}(X_1, X_2, \ldots, X_n)\]
where \(h_{(n)}: \mathbb{R}^n \rightarrow \mathbb{R}, \forall n \in \mathbb{N}\).
Examples of sample statistics: sample mean, sample variance, sample covariance, regression coefficient
Because sample statistics are function of random variables, they are random variables (cf population mean)
For i.i.d. random variables \(X_1, X_2, \ldots, X_n\), the sample mean is
\[\overline X = \frac{X_1 + X_2 + \ldots + X_n}{n} = \frac{1}{n} \sum_{i = 1}^{n} X_i\]
\(\overline{X}\) is a RV (and a sample statistic). Let’s summarize its distribution!
Proof that \(E[\overline{X}] = E[X]\) (Theorem 3.2.3):
\[\begin{align}
{\textrm E}[\overline{X}] &= {\textrm E}\left[\frac{1}{n}(X_1 + X_2 + \ldots + X_n) \right] \\\
&= \frac{1}{n} {\textrm E}\left[X_1 + X_2 + \ldots + X_n \right] \\\
&= \frac{1}{n} \left( {\textrm E}[X_1] + {\textrm E}[X_2] + \ldots + {\textrm E}[X_n] \right) \\\
&= \frac{1}{n} \left( n {\textrm E}[X] \right) = {\textrm E}[X]
\end{align}\]{n} \
&= ( + + + ) \
&= ( n ) = \end{align}
R[1] 2.1
# E[X], numerically: mean of a large sample from f(x)
mean(sample(x = xs, size = 10000, replace = T, prob = probs))[1] 2.0942
# E[\overline{X}], numerically: mean of many sample means
storage <- rep(NA, 10000)
for(i in 1:length(storage)){
storage[i] <- mean(sample(x = xs, size = 10, replace = T, prob = probs))
}
mean(storage)[1] 2.10102
Okay, so \({\textrm E}[\overline{X}] = {\textrm E}[X]\). What else can we say about its distribution?
How close will \(\overline{X}\) be to \({\textrm E}[X]\)?
One measure of potential (in)accuracy is \({\textrm V}[\overline{X}]\).
Theorem 3.2.4 says \({\textrm V}[\overline{X}] = \frac{{\textrm V}[X]}{n}.\) (See homework.)
What does this mean?
If \({\textrm E}[\overline{X}] = {\textrm E}[X]\), and \({\textrm V}[\overline{X}] = \frac{{\textrm V}[X]}{n}\), then with large \(n\) isn’t \(\overline{X}\) likely to give us something very close to \({\textrm E}[X]\)?
Yes! That’s what the weak law of large numbers says.
Theorem 3.2.8 Let \(X_1, X_2, \ldots, X_n\) be i.i.d. random variables with finite variance \(\text{V}[X] > 0\), and let \(\overline{X}_{(n)} = \frac{1}{n} \sum_{x=1}^n X_i\). Then
\[\overline{X}_{(n)} \overset{p}{\to} \text{E}[X]\]
(where \(\overset{p}{\to}\) means “convergence in probability”, as \(n\) increases)
Usually you don’t know \({\textrm E}[X]\), but WLLN tells us that if \(n\) is large then \(\overline{X}\) is probably close to it.
Bernoulli random variable, \(p = 2/3\).
If we take a large sample (e.g. \(n = 1000\)), we can show the sample mean at each value of \(n\) from \(20, 21, \ldots, 1000\).
On the next slides we show the results of doing this once, ten times, etc:
“WLLN says ‘With more \(n\), \(\overline{X}\) should get closer and closer to \({\textrm E}[X]\).’ The roulette ball hasn’t landed on red in a while. By the WLLN, I know that the ball is now especially likely to land on red.”
What is the gambler missing?
Estimating sample means is boring. When do we get to the good stuff? Prediction, causal inference, regression, machine learning, deep learning, etc.
But remember that
So it really is all about estimating sample means! This is the “plug-in principle”.
Which one is \({\textrm E}[X]\)?
Which one is \(\overline{X}\)?
Sampling distribution of an estimator: The distribution of \(\hat{\theta}\) (over repeated samples), as summarized by PMF/PDF \(f(\hat{\theta})\) or CDF \(F(\hat{\theta})\)
Bias of an estimator: The bias of an estimator \(\hat{\theta}\) is \({\textrm E}[\hat{\theta}] - \theta\).
If \({\textrm E}[\hat{\theta}] = \theta\), \(\hat{\theta}\) is unbiased.
Sampling variance of an estimator: The sampling variance of an estimator \(\hat{\theta}\) is \({\textrm V}[\hat{\theta}]\).
Standard error of an estimator: The standard error of an estimator \(\hat{\theta}\) is \(\sigma[\hat{\theta}] = \sqrt{{\textrm V}[\hat{\theta}]}\).
Plug-in principle: “Write down the feature of the population that we are interested in, and then use the sample analog to estimate it” (A&M, page 116)
For example,
Above we established that \({\textrm V}[\overline{X}] = \frac{{\textrm V}[X]}{n}\). (Remember what this means?)
But we never know \({\textrm V}[X]\), the population variance of \(X\).
So how can we estimate \({\textrm V}[X]\) from the sample? Plug-in principle!
Estimand in terms of expectations: \[{\textrm V}[X] = {\textrm E}[X^2] - {\textrm E}[X]^2\]
Estimator in terms of sample means: \[\hat{\text{V}}_{\text{plug-in}}[X] = \overline{X^2} - \overline{X}^2\]
Could also do \(\overline{(X - \overline{X})^2}\).
Our plug-in sample variance estimator: \(\hat{\text{V}}_{\text{plug-in}}[X] = \overline{X^2} - \overline{X}^2\)
Suppose our sample is samp below:
How would we compute the plug-in sample variance?
R’s var() function gives us a different answer:
[1] 2.77551
[1] 3.238095
Why? The plug-in sample variance estimator is biased (especially in small samples), and R’s var() function corrects for this.
Why is it biased?
Suppose Bernoulli RV \(X\) with \(p = 1/2\).
What is \({\textrm V}[X]\)?
Taking samples of size \(n = 2\), the possible samples are:
| \((x_1, x_2)\) | \(\overline{x}\) | \(f(\overline{x})\) | \(\hat{V}_{\text{plug-in}}[X] = \overline{(x - \overline{x})^2}\) |
|---|---|---|---|
| (0, 0) | 0 | 1/4 | 0 |
| (0,1) or (0,1) | 1/2 | 1/2 | 1/4 |
| (1,1) | 1 | 1/4 | 0 |
So \({\textrm E}\left[\hat{V}_{\text{plug-in}}[X]\right] = 0 \times 1/2 + 1/4 \times 1/2 = 1/8 < {\textrm V}[X] = 1/4\).
To see how biased (and how to correct):
\[\begin{align} {\textrm E}\left[\hat{\text{V}}_{\text{plug-in}}[X]\right] &= {\textrm E}[\overline{X^2} - \overline{X}^2] \\ &= {\textrm E}[\overline{X^2}] - {\textrm E}[\overline{X}^2] \\ &= {\textrm E}[\overline{X^2}] - \left({\textrm E}[\overline{X}]^2 + {\textrm V}[\overline{X}]\right) \tag{*Def} \\ &= {\textrm E}[X^2] - {\textrm E}[X]^2 - \frac{{\textrm V}[X]}{n} \\ &= \overbrace{{\textrm V}[X]}^{\text{target}} - \overbrace{\frac{{\textrm V}[X]}{n}}^{\text{variance of }\overline{X}} \\ &= \frac{n - 1}{n} {\textrm V}[X] \end{align}\]
*Def: \({\textrm V}[\overline{X}] = {\textrm E}[\overline{X}^2] - {\textrm E}[\overline{X}]^2\)
Plan:
Common situation:
We’ll do this again!
| Estimand | Estimator | Biased? |
|---|---|---|
| Pop. mean, \({\textrm E}[X]\) | \(\overline{X}\) | \({\textrm E}[\overline{X}] = {\textrm E}[X]\) |
| Pop. variance, \({\textrm V}[X]\) | \(\hat{{\textrm V}}_{\text{plug-in}}[X]\) | \({\textrm E}\left[\hat{{\textrm V}}_{\text{plug-in}}[X]\right] = \frac{n-1}{n} {\textrm V}[X]\) |
| Pop. variance, \({\textrm V}[X]\) | \(\hat{{\textrm V}}[X]\) | \({\textrm E}[\hat{{\textrm V}}[X]] = {\textrm V}[X]\) |
| (Sampling) var. of sample mean, \({\textrm V}[\overline{X}]\) | \(\frac{\hat{{\textrm V}}[X]}{n}\) | \({\textrm E}[\hat{{\textrm V}}[\overline{X}]] = {\textrm V}[\overline{X}]\) |
What do we know about the sample mean \(\overline{X}\) so far?
Can we say more about \(\overline{X}\)’s sampling distribution?
For example, what is \(\text{Pr}\left[\overline{X} - {\textrm E}[X] > c\right]\) for some \(c\)?
Consider Bernoulli random variable \(X\):
\[ f(x) = \begin{cases} 1/2 & x = 0 \\ 1/2 & x = 1 \\ 0 & \text{otherwise} \end{cases} \] (Equivalently, large population with equal number of 1s and 0s.)
If we draw 10,000 samples of size \(n\) and record the sample mean each time, what will the distribution of these sample means look like? (The sampling distribution of the sample mean.)
Theorem 3.2.24 Central limit theorem
Let \(X_1, X_2, \ldots, X_n\) be i.i.d. random variables with finite \({\textrm E}[X] = \mu\) and finite \({\textrm V}[X] = \sigma^2 > 0\). Then
\[ \overline{X} \overset{d}{\to} N\left(\mu, \frac{\sigma^2}{n}\right),\] i.e. the normal distribution with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\).
You already know two parts of this, which are true for any sample size:
The new part is the shape: as \(n\) goes to \(\infty\), the sampling distribution of \(\overline{X}_{(n)}\) becomes more normal.
Theorem 3.2.24 Central limit theorem
Let \(X_1, X_2, \ldots, X_n\) be i.i.d. random variables with finite \({\textrm E}[X] = \mu\) and finite \({\textrm V}[X] = \sigma^2 > 0\). Then
\[\begin{align} \frac{\sqrt{n} \left(\overline{X} - \mu\right)}{\sigma} &\overset{d}{\to} N(0, 1) \tag{Version 1} \\ \overline{X} - \mu &\overset{d}{\to} \frac{\sigma}{\sqrt{n}} N(0, 1) \\ \overline{X} - \mu &\overset{d}{\to} N\left(0, \frac{\sigma^2}{n} \right) \\ \overline{X} &\overset{d}{\to} N\left(\mu, \frac{\sigma^2}{n} \right) \tag{Version 2} \end{align}\]
Let \(X\) be Bernoulli random variable (e.g. coin flip)
Suppose \(n = 4\). How many ways are there to get a sample mean \(\overline{X}\) of
Generally, how many ways to get \(k\) successes in \(n\) trials?
\[{n \choose k} = \frac{n!}{k!(n - k)!} \]
Let’s compute the number of ways to get each number of heads between 0 and 1000 in 1000 tries:
Since each sequence of flips is equally likely, we can convert “number of ways” into “probability” by dividing by the total number of possible sequences.
For Bernoulli \(X\) (a very “un-normal” PMF!), \(\overline{X}\) normally distributed (with large \(n\)) because more ways to get sample mean close to \({\textrm E}[X]\).
Extend that intuition to other \(X\)’s:
We focused on the sample mean, but (given “mild regularity conditions”) all plug-in estimators are asymptotically normal (Theorem 3.3.6).
(Mild regularity conditions means that small changes in the CDF produce small changes in our sample statistic (technically statistical functional).)
Intuition:
Some practically irrelevant exceptions implied by the CLT theorem statement:
Sample mean of \(X\) when
More practically relevant exceptions: